Table of Contents

Problem Statement

Context

AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).

A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio.

We have to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.

Objective

To predict whether a liability customer will buy a personal loan or not.

Key questions to be answered

  1. What are the factors that influence a liability customer to buy a personal loan?
  2. Which segment of customers should be targeted more?

Data Description

The data contains the different attributes of liability customers of AllLife bank. The detailed data dictionary is given below.

Data Dictionary

Load and explore the dataset

Import the necessary packages

Read the dataset

Understand the shape of the dataset.

Check the data types of the columns for the dataset.

Observations

View the first and last 5 rows and sample of the dataset.

Observation:

Check for missing values

Check for duplicate values

Data Preprocessing

Feature Engineering

Observation:

Exploratory Data Analysis

Summary of the dataset.

Observations:

  1. Experience has a minimum value of -3. This is not possible as experience is no of years and it cannot be -ve.
  2. Annual Income has a wide range from 8 (\$8000) to 224 (\\$224000)
  3. Average spending on credit cards per month has minimum value zero and maximum value as 10k. 75th percentile is only 2.5K - there is a huge difference between 75th percentile and the max value and we'll have to check if the max value is an outlier.
  4. Value of house mortgage ranges from 0 dollars till 635K dollars and 75 percentile is at 101K dollars. We'll have to check if the max value is an outlier.

Observations:

Observations:

Observations:

Observations:

Univariate Analysis

Observations on Age

Observations:

Observations on Experience

Observations:

Observations on Income

Observations:

Observations on CCAvg

Observations:

Observations on Mortgage

Observations

Observations on Personal Loan

Observations on Family of the Customers

Observations:

The below observations are made with assumption that the customer is the head of the family

Observations on Education

Observations:

Observations on County

Observations:

Zooming into this plot gives us the below information.

Observations on Securities Account

Observations on Certificate of Deposit account

Observations on Online (Internet Banking)

Observations on Credit Card from other banks

Bivariate Analysis

Correlation between numerical variables

Observations:

Observations:

Zooming into these plots gives us the below information

Let's check the variation in Personal Loan with some of the other variables.

Personal Loan vs Family

Personal Loan vs Education

Personal Loan vs County

Personal Loan vs Securities Account

Personal Loan vs Certificate of Deposit Account

Personal Loan vs Internet Banking

Personal Loan vs Credit Card from other Banks

Personal Loan vs Age

Personal Loan vs Experience

Personal Loan vs Mortgage

Personal Loan vs Income

Personal Loan vs Credit Card Average

Model Building

Approach

Logistic Regression

  1. Data preparation
  2. Partition the data into train and test set.
  3. Build the logistic regression model
  4. Improve the model performance by changing the model threshold using AUC-ROC Curve

Decision Trees

  1. Data preparation
  2. Partition the data into train and test set.
  3. Built a CART model on the train data.
  4. Tune the model and prune the tree, if required.

Model evaluation criterion

Model can make wrong predictions as:

  1. Predicting a customer will buy a personal loan but in reality the customer will not buy - No Loss
  2. Predicting a customer will NOT buy a personal loan but in reality the customer will buy - Loss of opportunity

Which Loss is greater ?

How to reduce this loss i.e need to reduce False Negatives ?

Split Data

Logistic Regression

Helper Functions

First, let's create functions to calculate different metrics and confusion matrix so that we don't have to use the same code repeatedly for each model.

Build Logistic Regression Model

Finding the coefficients

Coefficient interpretations

Converting coefficients to odds

Coefficient interpretations

Interpretation for other attributes can be made similarly.

Checking model performance on training set

ROC-AUC

Model Performance Improvement using ROC-AUC curve

Optimal threshold using AUC-ROC curve

Checking model performance on training set

Model Performance Improvement using Precision-Recall curve

Optimal threshold using Precision-Recall curve

Checking model performance on training set

Model Performance Summary - Training

Let's check the performance on the test set

Using the model with default threshold

ROC-AUC on test set

Using the model with threshold of 0.123

Using the model with threshold 0.32

Model performance comparison

Sequential Feature Selector

Selecting subset of important features using Sequential Feature Selector method

Finding which features are important

Let's look at best 8 variables

Fitting logistic regession model

Let's Look at model performance

Model Performance Comparison

Conclusion - Logistic Regression

Decision Trees

Helper Functions

Build Decision Tree Model

Checking model performance on training set

Checking model performance on test set

Visualizing the Decision Tree

Reducing over fitting using GridSearch for Hyperparameter tuning

Checking model performance on training set

Checking model performance on test set

Visualizing the Decision Tree

Reducing over fitting using Cost Complexity Pruning

Cost Complexity Pruning

Next, we train a decision tree using the effective alphas. The last value in ccp_alphas is the alpha value that prunes the whole tree, leaving the tree, clfs[-1], with one node.

For the remainder, we remove the last element in clfs and ccp_alphas, because it is the trivial tree with only one node. Here we show that the number of nodes and tree depth decreases as alpha increases.

Recall vs alpha for training and testing sets

Checking model performance on training set

Checking model performance on test set

Visualizing the Decision Tree

Comparing all the decision tree models

Conclusion - Decision Trees

Logistic Regression, Decision Tree Models Comparison

Conclusion:

Based on the above reasons, we should use Decision Tree (Post-Pruning) for modeling

Business Insights